Parallel LDA Through Synchronized Communication Optimizations
نویسندگان
چکیده
Sophisticated big data machine learning applications are difficult to parallelize because it not only needs to process a big training dataset, it also needs to synchronize big model data in iterations. In parallel LDA, comparing synchronized and asynchronous communication methods under data parallelism and model parallelism, we note that the power-law distribution of word counts in LDA training datasets suggests using synchronized communication optimizations can improve the efficiency of the model update to allow the model to converge faster, shrink the model size, and further reduce the computation time in later iterations. Therefore, we abstracted new synchronized communication operations and developed two new parallel LDA implementations “lda-lgs” and “lda-rtt”. We compare our new approaches to leading implementations in the field on an Intel Haswell cluster with 100 nodes, 4000 threads. In data parallelism, “lda-lgs” can reach higher model likelihood with shorter or similar execution time compared with Yahoo! LDA. In model parallelism, when achieving similar model likelihood, “lda-rtt” can run up to 3.9 times faster compared with Petuum LDA.
منابع مشابه
High Performance LDA through Collective Model Communication Optimization
LDA is a widely used machine learning technique for big data analysis. The application includes an inference algorithm that iteratively updates a model until it converges. A major challenge is the scaling issue in parallelization owing to the fact that the model size is huge and parallel workers need to communicate the model continually. We identify three important features of the model in para...
متن کاملHarpLDA+: Optimizing latent dirichlet allocation for parallel efficiency
Latent Dirichlet Allocation (LDA) is a widely used machine learning technique in topic modeling and data analysis. Training large LDA models on big datasets involves dynamic and irregular computation patterns and is a major challenge to both algorithm optimization and system design. In this paper, we present a comprehensive benchmarking of our novel synchronized LDA training system HarpLDA+ bas...
متن کاملOptimizations for Parallel Computing Using DataAccess
Given the large communication overheads characteristic of modern parallel machines, optimizations that eliminate, hide or parallelize communication may improve the performance of parallel computations. This paper describes our experience automatically applying communication optimizations in the context of Jade, a portable, implicitly parallel programming language designed for exploiting task-le...
متن کاملOptimizations for Message Driven Applications on Multicore Architectures
With the growing amount of parallelism available on today’s multicore processors, achieving good performance at scale is challenging. We approach this issue through an alternative to traditional thread-based paradigms for writing shared memory programs, namely message driven multicore programming. We study a number of optimizations that improve the efficiency of message driven programs on multi...
متن کاملEvaluating Compiler Optimizations for Fortran D
The Fortran D compiler uses data decomposition speciications to automatically translate For-tran programs for execution on MIMD distributed-memory machines. This paper introduces and classiies a number of advanced optimizations needed to achieve acceptable performance; they are analyzed and empirically evaluated for stencil computations. Communication optimizations reduce communication overhead...
متن کامل